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Abstract 

Variational Bayes (VB) is rapidly becoming a popular tool for Bayesian 
inference in statistical modeling. However, the existing VB algorithms are 
restricted to cases where the likelihood is tractable, which precludes the use 
of VB in many interesting situations such as in state space models and in 
approximate Bayesian computation (ABC), where application of VB methods 
was previously impossible. This paper extends the scope of application of VB 
to cases where the likelihood is intractable, but can be estimated unbiasedly. 

The proposed VB method therefore makes it possible to carry out Bayesian 
inference in many statistical applications, including state space models and 
ABC. The method is generic in the sense that it can be applied to almost all 
statistical models without requiring too much model-based derivation, which is 
a drawback of many existing VB algorithms. We also show how the proposed 
method can be used to obtain highly accurate VB approximations of marginal 
posterior distributions. 
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1 Introduction 


Let y be the data and 6^ G 0 the vector of model parameters. We are interested 
in Bayesian inference abont 0, based on the posterior distribntion nij)) = p{6\y) oc 
p{d)p{y\d), with p{9) the prior and p{y\9) the likelihood fnnction. In this paper, we are 


interested in Variational Bayes (VB), which is widely nsed as a compntationa ^ 


tive m ethod for approximating the posterior distribntion ti{9) flAttiasl . 


1999 


ly effec- 


Bishop 


20061 ). VB approximates the posterior by a distribntion q{9) within some tractable 


class, snch as an exponential family, chosen to minimize the Knllback-Leibler diver¬ 
gence between q{9) and vr(6'). Most of the VB algorithms in the literatnre reqnire 
that the likelihood p{y\9) can be compnted analytically for any 9. 

In many applications, however, the likelihood p{y\9) is intractable in the sense that 
it is infeasible to compnte p{y\9) exactly at each valne of 9, which makes it difScnl 


to ns e VB for inference. For example, in state space models (jPnrbin and Koopmanl. 


200lh . which are widely nsed in economics, hnance and engineering, the likelihood is 


a high dimensional integ r al ove r the state variables governed by a Markov process. 


Ghahramani and Hinton! fj2000l) were the hrst to nse VB for inference in state space 


models. However, they only consider the special case in which the time series is seg¬ 
mented into regimes with each regime assnmed to follow a linear-Ganssian state space 
model. F or general state spac e mode ls, it is still a challenging problem to do inference 


with VB. 


Turner and Sahanil (120111 ) discnss some of the difhcnlties in applying VB 


methods to time series models. Another example of a sitnatio n where implement¬ 


ing VB is difficnlt is approximate Bay esian compntation (ABC) (ITavare et al 


Marin et ah 


2012 : 


Peters et al. 


19971: 


2OI2I) . ABC methods provide a way of approximat¬ 


ing the posterior nij)) when the likelihood is difficnlt to compnte bnt it is possible to 
simnlate data from the model. We are not aware of any work that nses VB for in- 
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ference in ABC , although a closely rela t ed tec hnique called Expectation Propagation 


has been used flBarthelme and Chopinl . 


2014f) . This paper proposes a VB algorithm 


that approximates 7r{6) when the likelihood is intractable. The only requirement is 
that the intractable likelihood can be estimated unbiasedly. The proposed algorithm 
therefore makes it possible to carry out variational Bayes inference in many statistical 
models with an intractable likelihood, where this was previously impossible. 

In many models, by introducing a latent variable a, the joint density p{y,a\9) 
is tractable. This makes it much easier to work with the joint posterior p{6,a\y) oc 
p{9)p{y, (y\0) rather than the marginal posterior of interest 7r(0) itself. In this situation 
many VB algorithms in the literature approximate the joint posterior p{6,a\y) by a 
factorized distribution q{0)q{a), and then use q{9) as an approximation to 7i{9). The 
main drawback of this approach is that the (usually high) posterior d ependence be¬ 


tween 9 and a is ignored, which might lead to a poor VB approximation flNeville et al 


20141 ). Our VB algorithm approximates 7t{9) directly with the latent variable a inte¬ 


grated out and thus overcomes this drawback; see the example in Section 15.11 

Section [2] presents our approach, which we call Variational Bayes with Intractable 
Likelihood (VBIL), when the likelihood can be estimated unbiasedly. VBIL trans¬ 
forms the problem of approximating the posterior 7r{6) into a stochastic optimization 
problem using a noisy gradient. It is essential for the success of stochastic optimiza¬ 
tion algorithms to have a gradient estimator with a sufficiently small variance. Section 
[3] describes several techniques, including control variate and quasi-Monte Carlo, for 
variance reduction in estimati ng the gradien t. This section also discusses the impor¬ 


tance of the natural gradient flAmari 


19981) . which takes into account the geometry 


of the variational distribution q{6) being learned. 

Unlike many VB algorithms that are derived on a model-by-model basis and re¬ 
quire analytical computation of some model-based expectations, one of the main 
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advantages of VEIL is that it can be applied to almost all statistical models without 
requiring an analytical solution to model-based expectations. The only requirement is 
that we are able to estimate the intractable likelihood unbiasedly. The VEIL method¬ 
ology is therefore generic and widely applicable. As a by-product, VEIL provides an 
estimate of the marginal likelihood, which is useful for model choice. 

There are several lines of work rel ated t o ours in terms of working with an in¬ 


tractable likelihood. 


Eeanmont 


(1200811 and 


Andrieu and RobertsI fl2009[l show that 


Markov chain Monte Carlo simulation based on an unbiased estimator of the likeli¬ 
hood is still able to generate samples from the posterior. This method is known in the 
literature as Pseudo-Marginal Metropolis-Hasting (PMMH). More efficient variants 
of PM MH, called correlated PMMH and blqckwis e PMMH, have bee n proposed re¬ 


cently fiDeligiannidis et ah 


2015 


Tran et ah 


201611 . 


Tran et ah! (1201311 show that im¬ 


portance sampling with the likelihood replaced by its unbiased estimator is still valid 
for estimating expectations with respect to the posterior, and name their method as 
Impor tance Sampling Squared (IS^). Also, 


Duan and Fulop 


( 20l 4 and 


Tran et ah 


(I 2 OI 4 J ) use sequential Monte Carlo for inference based on an unbiased likelihood esti¬ 


mator. The main advantage of VEIL is that it is several orders of magnitude faster 
than these competitors. 

Section m studies the link between the precision of the likelihood estimator to the 
variance of the VEIL estimator. This helps to understand how much accuracy is 


lost when working with an estimated l i 


is available. In this spirit. 


celiho od co mpared to t he cas e the likelihood 


Pitt et al.l ( 2 OI 2 I) and 


Tran et ah 


( 2013 1 show that the 


asymptotic variance of PMMH and IS^ estimators increases exponentially with the 
variance of the likelihood estimator. Therefore, it is critical for these methods to have 
a likelihood estimator that is accurate enough. They show that the variance of the 
likelihood estimator should be around 1 in order to minimize the computing time that 
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is needed for the variance of PMMH/IS^ estimators to have a hxed precision. For 
VEIL, we show that the asymptotic variance of VEIL estimators increases linearly 
with the variance of the likelihood estimator. The proposed methodology is therefore 
useful in cases when only highly variable estimates of the likelihood are available. We 
discuss such a situation in Section 15.11 where VEIL works well while its competitors 
fail. 

Several interesting applications of VEIL are presented in Section |5l Section 15.11 
shows the use of VEIL for generalized linear mixed models and demonstrates the 
high accuracy of VEIL compared to the existing VE algorithms. Section 15.21 applies 
VEIL to Eayesian inference in state space models and Section 15.31 shows how VEIL 
can be used for AEG. To the best of our knowledge, our paper is the hrst to use a 
VE method in the most general way for Eayesian inference in state space models and 
AEG. Another interesting application of VEIL is presented in Section 15.41 in which 
we illustrate that VEIL provides an attractive way to improve the accuracy of VE 
approximations of marginal posteriors. 


2 Variational Bayes with an intractable likelihood 


This section describes the basic form of the proposed VEIL algorithm, where an 
unbiased estimator of the likelihood is available. Denote by PN{y\0) an unbiased 
estimator of the likelihood p{y\0). Here N is an algorithmic parameter relating to the 
precision in estimating the likelihood, such as the number of samples if the likelihood 
is estimated by importance sampling or the number of particles if the li kelihood in 


state s pace models is estimated by a particle filter. Using the terminology in 


Pitt et al 


(120121) . we refer to N as the number of particles. Let ^ = \ogpp^{y\9) — \ogp{y\9), 


so that PN{y\9) = p{y\9)e^, and denote by gN{z\9) the density of 2 ;. Note that ^ 
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is unknown as we do not know \ogp{y\9) and there is no need to compute z in 
practice, but, as will become clear shortly, it is very convenient to work with 2 ;. We 
sometimes write j)N{,y\9) as PN{,y\9i z). We note that J e^gN{z\9)dz = 1 because of the 
unbiasedness of the estimator PN{y\9)- Dehne the following density on the extended 
space 0 X M 


t^n{9,z) 


p{9)p{y\9)e^gNiz\9) 

p{y) 


= 7i{9)e''gN{z\9). 


This augmented density admits the posterior of interest 7r(0) as its marginal. It is use¬ 
ful to work with nN{9,z) as the high-dimensional vector of random variables involved 
in estimating the likelihood is transformed into the scalar A direct approximation 
of 71n{9,z) is q\^N{9,z) = qx{9)e^g]\f{z\9), where qx{9) is the variational distribution 
with the variational parameter A to be estimated, and then qx{9) can be used as an 
approximation of vr(0). However, it turns out that it is impossible to estimate the 
gradient of the Kullback-Leibler divergence between qx,N{9,z) and 7iixf{9,z) as this 
requires knowing x. 

We propose instead to approximate 71n{9,z) by qx,N{9,z) = qx{9)gixf{z\9). This 
augmented density has the attractive features that qx{9) is its marginal for 9 and it is 
possible to estimate the gradient of the Kullback-Leibler divergence KL(A) between 
Qx,n{9,z) and 7iixf{9,z) (c.f. ([2]) below). Although qx,N{9,z) does not provide a good 
approximation of the posterior marginal of z, the latter is not of interest to us. 
Furthermore, under Assumptions 1 and 2 given in Section 4, the minimization of 
KL(A) is equivalent to the minimization of the KL divergence between qx{9) and 
7r{9). 

The Kullback-Leibler divergence between qx^N{9,z) and 7iixf{9,z) is 


KL(A) = 


qx{9)gN{z\9) log 


q\{0)gN{z\9) 

71n{9,z) 


dzd9, 


( 1 ) 
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where we omit to indicate dependence on N for notational convenience. The gradient 
of KL(A) is 



= (vA[loggA(^)](loggA(^) - log{p{9)pN{y\0,z)^'^ . (2) 

Here, we have used the facts that VA[q'A(6')] =q'A(6')VA[loggA(^)] and that E(VA[loggA(6')]) = 
0. It follows from ([2]) that, by generating 9 rsj qx{9) and 2 ; gN{z\9), it is straightfor¬ 

ward to obtain an unbiased estimator VaKL(A) of the gradient VaKL(A). Therefore, 
we can use stochastic optimization to optimize KL(A). We note that the unknown z 
is implicitly generated when the unbiased likelihood estimator PN{y\9) =PN{y\9,z) is 
computed. In practice, is never dealt with explicitly and it only plays a theoretical 
role in the mathematical derivations. The basic algorithm is as follows 

Algorithm 1. • Initialize and stop the following iteration if the stopping 

criterion is met. 

• For f = 0,1,..., compute = —atVxKLlX^l). 

We will refer to this algorithm as Variational Bayes with Intractable Likelihood 
(VEIL). The sequence {at} should satisfy at>0, ^jat = cx) and <C)0. We choose 
at = l/(l-|-f) in this paper. It is also possible to train at adaptively. 

It is important to note that each iteration is parallelizable, as the gradient VaKL(A) 
is estimated by independent samples from qx,N{9,z). 
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2.1 Stopping criterion and marginal likelihood estimation 


An easy-to-implement stopping rule is to stop the updating procedure if the change 
between and e.g. i n term s of the Euclidean distance, is less than some 


threshold e flRanganath et ah 


20141 . However, it is difficult to select e as such a 


distance depends on the scales and the length of the vector A. It is easy to show that 


\ogp{y) = j log z)dzde + KL(A) > LB(A), (3) 


where 


LB(A) = j log qxA0,z)dzd9 

= E0,^[logp(0) - \ogqx{9) + \ogpN{y\9,z)] 


(4) 


is the lower bound on the log of the marginal likelihood log p{y). This lower bound 
after convergence can be used as an approximation to log p{y), which is useful for 
model selection purposes. The expectation of the first two terms in (0]) can be com¬ 
puted analytically, while the last term can be estimated unbiasedly by samples from 
Qx,n{9,z). However, in our experience, estimating the entire expectation (0]) based on 
samples from qx^N{9,z) leads to a smaller variance. Denote by LB(A) the resulting 
unbiased estimate of LB (A). Although LB (A) is strictly non-decreasing over itera¬ 
tions, its sample estimate LB(A) might not be. To account for this, we suggest to 
stop the updating procedure if the change in an averaged value of the lower bounds 
over a window of M iterations, LB(At) = (l/M)^^^LB(At_fc+i), is less than some 
threshold e. At convergence, the values LB(Ai) stay the same, therefore LB(At) will 
average out the noise in LB(Ai) and is stable. Furthermore, we suggest to replace 
LB(A) by a scaled version of it, LB(A)/n with n the size of the dataset such as the 
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number of observations. The scaled lower bound is more or less independent of the 
size of the dataset (c.f., Figure |3]). We set M = 5 and e = 10“^ in this paper. 


3 Variance reduction and natural gradient 

As is typical of stochastic optimization algorithms, the performance of Algorithm [1] 
depends greatly on the variance of the noisy gradient. This section describes several 
techniques for variance reduction. 


3.1 Control variate 

Denote h{9,z) = log {p{9)pp^{y\9,z)) for notational simplicity. Let 9s^qx{9) and Zg^ 
gN{z\9s), s = l,...,S, be S samples from the variational distribution q\^N{9,z). A naive 
estimator of the ith element of VaKL(A) is 

Va.KL(A)"“™ = ^ V VA,|log,,(«.)|(logto(e,) - (5) 

S=1 

whose variance is often too large to be useful. For any number Cj, consider 

VyKL(A) = -^^Vx^[logqx{9s)]{logqx{9s) -h{9s,Zs) - q), (6 ) 


which is still an unbiased estimator of VAiKL(A) since E(VA[loggA(6')]) =0, whose vari¬ 
ance can be greatly redu 
ered in the literature, see 


Paislev et ah 

(2012) 

. Nott et ah 

(2012 

) and 

Raneanath et ah 
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fj2014j) . The variance of VAiKL(A) is 


V(vIla(A)) = ^v(vAjloggAW](loggAW-h(0,z)-Q)) 

= ^V(^VAjloggA(0)](loggA(0) -h(d,z)}'j 

- ^cov(^VA,lloggx(d)](loggx(0) -h(0,z)),VxAioggx(0)]^ 

+ |v(VAjloggAW]). 

The optimal q that minimizes this variance is 

Ci = cov(^VxA^oggx(0)](loggx(0) -h(0,z)),Vx,[ioggx(0)]^ /v(^VAjloggA(^)])- (7) 


Then 

V(viyL(A)) = V(VI^L(A )“™)(1 - pf) < V(VI^L(A)“™), 

where pi is the correlation between VAjloggA( 6 ')] (logq'A( 6 ') —h( 6 ',; 2 )) and VAjlogq'A( 6 ')]. 
Often, pf is very close to 1. 

We estimate the nnmbers c* by samples {Os^Zg) ^gx,N{0,z) as in ([7]). In order to 
ensnre the nnbiasedness of the gradient estimator, the samples nsed to estimate q 
mnst be independent of the samples used to estimate the gradient. In practice, the 
Ci can be updated sequentially as follows. At iteration t, we use the c* computed in 
the previous iteration t — 1, i.e. based on the samples from q'A(‘-i),Ar( 6 ', 2 ;), to estimate 
the gradient VaKL(A^*^), which is estimated using new samples from g^it) j^{9,z). We 
then update the c* using this new set of samples. By doing so, the unbiasedness is 
guaranteed while no extra samples are needed in updating the numbers q. 

The gradient in the form of ([2]) can be written as a sum of two terms, where the 
hrst term E 5 )...,q,^( 0 )(VA[logq'A( 6 ')]loggA( 6 *)) can be in most cases computed analytically. 
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However, as pointed out by a referee, this term should be estimated using the same 
samples of 6 as we do in (|6]). Doing so helps to reduce the noises in estimating the 
gradient. This is because the hrst term plays the role of a con trol variate. This 


Salimans and KnowlesI (120131) . 


phenomenon is discussed in detail in 

3.2 Natural gradient 

Intuitively, a different learning rate should be used for each scale in the gradient 
vector. That is, the traditional gradient vector VaKL(A) should be multiplied by 
an appropriate scale matrix. It is well-known that the traditional gradient dehned 
on the Euclidean s pace does no t adequately capture the geometry of the variational 


distribution q\{0) flAmaril . 


19981) . A small Euclidean distance between A and X' does 


not ne c essari ly mean a small Kullback-Leibler divergence between qx{G) and qy{0)- 


Amaril (119981) dehnes the natural gradient as 

VaKL(A)“" = /f(A)-VaKL(A), 


( 8 ) 


with If{X) the Fisher information matrix, and suggests usi ng the natural gradien t as 


an efficient alternative to the traditional gradient. See also 


Hoffman et ah 


(129131 ) 


If the variational distribution q\{0) has the exponential family form 

®(9)=exp(r(9)'A-Z(A)), 


(9) 


with T{6) the vector of sufficient statistics and A the vector of natural parameters, 
then /j7’(A) = coVg^(T(6*),T(6*)) is computed analytically. 


The use of 

ffie natural gradient in VB a 

gorithms is considered, among others, bv 

Honkela et ah 

(201Q 

), 

Hoffman et ah 

(2Q13 

) and 

Salimans and Knowles 

(2013 

). We 
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demonstrate the importance of the natural gradient using a simple example where the 
likelihood is available. We consider a model where the data yi^B[lfi) - the Bernoulli 
distribution with probability 9. We generate n = 200 observations iji from i?(l,0 = O.3) 
and obtain k = Y^^yi = b7. We use a uniform prior on 6. Then, the posterior p{6\y) 
is Beta(/c + l,n —fc + l). The variational distribution qx{9) is chosen to be Beta(a,/5) 
which belongs to the exponential family with the natural parameter A = (a,/?)'. The 
Fisher information matrix If{^) is 


If{X) = 


ipi{a) - il)i{a + (3) '0i(a + /5) 

iJi{a + l3) +/ 3 ) 


where = Va;a;[logr(a;)] is the trigamma function. In this simple example, the 

gradient VaKL(A) in (|2]) can be computed analytically 


VAKL(A)=Ji.(A)A-hf(A) 


with 


H{X) 


k%lji{a) — n'lpi^a + (3) 

(n - k)'ipi{l3) - n'4)i{a + (3) 


Using the traditional gradient, the update in Algorithm [T] is 


yh+i) = yW _ at(^J^(A('))AW - //(A^)). 


Using the natural gradient, the update is 

A(t+i) = (1 _ 
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Figured] plots the densities of the exact posterior 7t{6) and the variational distributions 
qx{0) estimated by the VEIL using the traditional gradient and the natural gradient, 
with two different random initializations. The hgure shows that the VEIL algorithm 
using the natural gradient is superior to using the traditional gradient. Furthermore, 
the VEIL algorithm based on the natural gradient is insensitive to the initial 




Figure 1: Plots of the densities of the exact posterior and the variational approxima¬ 
tion estimates, at convergence, with two different starting values at random. 

3.3 Factorized variational distribution 

Often, the variational distribution qx{G) is factorized into K factors 


9a(S) = 


( 10 ) 


Then, each factor qx(k){0^^'^) is updated separatel y and the variance of 


of the corresponding g radient can be reduced. 


ASalimans and Knowles! (12013! ) and 


lie estimate 


Ranganath et al.l (1201411 consider variance reduction using factorization. Denote by 


hk{0,z) the terms in h{6,z) that involve only 6^^^ and z. From (|2|), and noting that 
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IE 0 , 2 (V;^(fe)[logg;^(fc)( 0 *^^))]) = 0 , the traditional gradient corresponding to factor k is 

V,<„KL(A) = (v,<«|log?«„(9<'=')](log<;j,«(9'‘>) - At(9,2))) . 

( 11 ) 

In the case qx{k){d^’^'>)=exp{Tk{9^^^)'X^^^ — belongs to an exponential family, 

the natnral gradient corresponding to factor k is 


V2.)KL(A)“"‘> = /f,j(A<‘')-'Vj,«KL(A), 


( 12 ) 


where lF,kiX^^^) is the Fisher information matrix of distribntion qxwiO^^^). 

Estimating the gradient using flTT]) has less variation than using (|2]). Intuitively, 
this is because the variation due to t e rms n ot involving has been removed. This 
is also explained in Ranganath et al.1 ( 2014 ) as a Rao-Blackwellization effect. 


3.4 Randomised quasi-Monte Carlo 


Numerical integration using quasi-Monte Carlo (QMC) has been proven efficient 
in many applications. Instead of generating uniform random numbers t/(0,l) as 
in plain Monte Carlo methods, QMC generates deterministic sequences that are 


more evenly c 
discrepancy. 


istributed in f 0 ,l) in the sense that they minimise the so-called star- 


Dick and Pillichshammerl (120101) provide an extensive background on 


QMC. It is shown that, in many cases, QMC integration achieves a better conver¬ 
gence rate than Monte Carlo integration. In this paper, we use randomised quasi- 
Monte Carlo (RQMC) as VEIL requires an unbiased estimator of the gradient. By 
introducing a random element into a QMC sequence, RQMC preserves t he low- 


discr e pancy property and, at the sam e time, leads to unbiased estimators (lOwenl . 


19971: 


Dick and Pillichshammerl . 


20101 ) 
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Here, we use RQMC to sample 9r^qx{9). This will help to reduce the variance of 
the noisy gradient if the dimension of 9 is not too high. Of course, one can also use 
RQMC in the likelihood estimation, but given some time constraint we do not pursue 
this idea in this paper. 


4 The effect of estimating the likelihood 


This section studies the effect of the variance of the noisy likelihood on the VEIL 
estimators, and provides guidelines for selecting the number of particles N. A large 
N gives a precise likelihood estimate and therefore an accurate estimate of A, but at 
a greater computational cost. A small N leads to a large variance of the likelihood 
estimator, so a larger number of iterations is needed for the procedure to settle down. 
It is therefore useful in practice to have some guidelines for selecting N. 


In order to understand the effect of estimating the likelihood, we follow 


Pitt et al 


(120121 ) and make the following assumption. 


Assumption 1. There is a function {9) >0 such that'E{z\9) = — -^^^ andY{z\9) = 
iHe) 


N 


More precisely. 


Pitt et al. 


(2012) assume further that 2 ^ ~ A/’(—in or- 


der to deri v e a th eory for selecting a n opt imal N. This assumption is justified in 

a 


Tran et al. 


(120 bt l and 


Doncet et al. 


(120151 ) making use of the unbiasedness of the 


likelihood estimate. The reason that the mean of z is — | times its variance is be¬ 
cause E(e^) = 1 in order for the likelihood estimator to be unbiased. 

Assumption 2. For a given cr^ > 0, let N be a function of 9 and such that 


Y{z\9)=a^, t.e. N = N^ 2 (9) = j\9)/. ThenE{z\9) = -^ andY{z\9)=a^ 
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Suppose that the equation VaKL(A) = 0, with KL(A) in ([T]), has the unique solution 
A*. Let Am be the estimator of A* obtained by Algorithm 1 or 2 after M iterations, and 
Am be the corresponding estimator obtained when the exact likelihood is available. 
Denote C*(^) = VA[loggA(6*)]and denote by E*(.) and V*(.) the expectation and 
variance operators with respect to q\*{9). For simplicity, we consider the case that A 


IS sea 


ar; the case with a multivariate A can be obtained using Theorem 5 of 


Sacks 


fll958l) . We obtain the following results whose proof is in the Appendix. 


Theorem 1. Suppose that Assumptions 1 and 2 are satisfied, and that the regularity 


conditions in Theorem 1 of 
(i) Then, 


Sack^ til 958 ) hold. 


Vm(Xm - A*) 4 ^^(o, CA*V(vIia(A*))), asM ^oo, (13) 


where c\* is a positive constant that depends only on geometric properties of the 
function VaKL(A*) and is independent of the random variables involving in estimating 
VaKL(A*), i.e. c\* is independent of . 

(a) Let (T^gy^(AM) =ca*V(VaKL(A*)) he the asymptotic variance of Xm as M— )-oo. 
Similarly, let a‘^gy^{XM) be the asymptotic variance ofXM- Then, 


a 


asym\ 


(Am) — + (y‘^T{X*, S), 


(14) 


where r(A*,S') = CA»V*{(C*(6*)}/S' if the noisy traditional gradient is used, andT{X* ,S) = 
CA»/F(A*)“^V*{(C*(6')}/i?(A*)“^/S' if the noisy natural gradient in ([HD is used. 


These results show that the variance of VEIL estimators increases linear 
(T^. For PMMH and IS^ estimators. 


Pitt et ah 


(1201211 and 


Tran et ah 


y with 


(1201311 show 


that their variances increase exponentially with This means that VEIL is useful 
in cases where only a rough estimate of the likelihood is available, or it is expensive 
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to obtain an accurate estimate of the likelihood. 

We now discuss the issue of selecting We note that under Assumption 2, 
N is tuned depending on 9 as N = N„2(9) = 7 ^( 6 ')/cr^, so th e time to c o mput e the 


likelihood e stima te PN{y\9) is proportional to l/cr^. Then, 


Pitt et ah 


Tran et ah 


(I2ni2h and 


( 2013 ) show that, for the PMMH and IS^ methods, the optimal cr^ that 


gives an optimal trade-off between the CPU time and the variance of the estimators 
is 1. For VEIL, the computing time can be defined as 


CT(a2) = 


(Am) 


( A ) 


asym \ 




+ t(A-,S), 


(15) 


where neither cr^yin(AM) nor t{X*,S) depends on These results suggest that 
should be set to a large value, as long as it is not too large for the stochastic search 
procedure in Algorithms 1 and 2 to converge. 


5 Applications 


5.1 Application to generalized linear mixed models and panel 
data models 


Generalized linear mixed models (GLMM) (see, e.g. 


Fitzmaurice et ah 


201 ll) . also 


known as panel data models, use a vector of random effects to account for the 
dependence between the observations i/i = {yijj = l,...,nj} measured on the same 
individual i. Given the random effects Oj, the conditional density p{yi\9,ai) belongs 
to an exponential family. The joint likelihood function of the model parameters 6 
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and the random effects a = (ai,...,a„), is tractable 


p{y,a\e) = Wp{ai\e)p{yi\e,ai). 

i=l 


Typically in the VB literatnre the joint posterior p{6,a\y)(xp{6)p{ii,a\6) is approxi¬ 


mated by a variational distribntion of the form q{0)q{a), and t 
approximation to the marginal posterior p{6\y). For example, 


len q(9) is use d as an 


Tan and Nott 


fl2ni3h 


take this approach but use partiall y non-centered parameteriz ations to reduce de¬ 


pendence between parameter blocks. 


Ormerod and Wandl (120121 ) consider frequentist 


estimation of 9, but using VB methods to integrate out a. As discussed in the 
introduction, factorization of the VB distribution generally ignores the posterior de¬ 
pendence between 6 and a, which often leads to underestimating the variance in the 
posterior distribution of 6. Below, we refer to such a VB method as classical VB. 

The likelihood, p{y\9) = YYI;^^p{yi\6) with p(j/i|6') = Jp{yi\6,ai)p{ai\6)dai an integral 
over the random effects «*, is in most cases analytically intractable but can be easily 
estimated unbiasedly using importance sampling. Let hi{ai\y,9) be an importance 
density for Oj. The integral p{yi\0) is estimated unbiasedly by 


1 

PNiiVil^) = Wi{ay’,9) = 

* i=i 


(i) ... m _ Pi.yM^\W(^'f’\^) (pad 


„b)i 


hi{,a^i\y,9) 


, (16) 


It is possible to use different Ni for each p{yi\6). Hence, Pj^{y\6) = Y\^^^pjq.{yi\6) is an 
unbiased estimator of the likelihood p{y\9). The variance of z = \ogpj^{y\6)—\ogp{y\6) 

is 

n 

N{z\e) = N{\ogpN{y\e)) = Y.N{\ogpNM0)). (17) 

i=l 
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which can be estimated by V( 2 ;| 6 ') = ^”^^V(logp 7 Vi( 2 /i| 6 ')) with 


Y (log pN.{yi\9)) = 


W) 

N, 




(J) 


, 7iW = 




- 1 . 


(18) 


Given a fixed it is therefore straightforward to target Y{z\6)=a‘^ by selecting iVj 
snch that Y{\ogpNi{yi\6)) zza'^/n. 


Six City data 


We no w illnstrate the VEIL algorithm using the Six City data in 


Fitzmaurice and Laird 


(119931) . The data consist of binary responses yij which indicate the wheezing status 


(1 if wheezing, 0 if not wheezing) of the ith child at time-point j, z = l,...,537 and 
j = l,...,4. Covariates are the age of the child at time-point j, centered at 9 years, and 
the maternal smoking status (0 or 1). We consider the following logistic regression 
model with a random intercept 


p(^yij\/3,ai) = Binomial (l,pij), 
logit(py) = Pi+P2-(^geij+PsSmokeij+ai, a*~A/'(0,r^). 

The model parameters are 6 = {/3,t‘^). We use a normal prior 7\^(0,50J3) for f3 and a 
Gamma(l,0.1) prior for 

We use the variational distribution qx{0) = q{(3)q{T'^), where q{(3) is a (i = 3—variate 
normal A/'(/i,S) and q{T^) is an inverse gamma distribution. We then run Algorithm 
2 , see the Appendix for the detail, with S = 1000 samples to estimate the gradient. 
The likelihood is estimated as in flTOll with the natural sampler hi{ai\y,6) = p{ai\6), 
which is the normal distribution A/'(0,r^) in our case. The cr^ in Section 0] is set to 4, 
which on average requires A’ = ^Aj/n=124 particles. Using a larger will lead to 
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Figure 2: Application to GLMM: Six City data 



too small Ni that makes the estimate in (ITS]) unreliable. Figure El^a) plots the scaled 
lower bounds over the iterations. 

We compare the performance of the classical VB and VBIL algorithms to the 


pseudo-marg inal MCMC simu lation flAndrieu and Robertsl . 


suggested in 


Pitt et ah 


2009). We set cr^ = 1 as 


(120121) . The MCMC chain , based on the adaptive random 


walk Metropolis-Bastings algorithm in 


Haario et ah! (120011) . consists of 20000 iterates 


with another 10000 iterates used as burn-in. 

Figure |2] plots the classical VB estimates (dashed line), MCMC estimates (dot¬ 
ted line) and the VBIL estimates (solid line) of the marginal posteriors p{/3i\y) and 
p(r^|l/). The MCMC density estimates are carried out using the kernel density esti¬ 
mation method based on the built-in Matlab function ksdensity. The hgure shows 
that the VBIL estimates are very close to the MCMC estimates. The classical VB 
underestimates the posterior variance of in this example. The clock times taken 
to run the VB, VBIL and MCMC procedures are 4, 2.9 and 505 minutes respec¬ 
tively. However, we note that the running time depends on many factors such as the 
programming language being used and the initialization of the procedures. All the 
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(a) (b) (c) 

Figure 3: Plots of scaled lower bounds over the iterations: (a) Six City example, (b) 
state space example, (c) ABC example 

examples in this paper are run on an Intel Core 16 i7 3.2GHz desktop supported by 
the Matlab Parallel Toolbox with 8 local processors. Obviously, the more processors 
we have, the faster the VEIL procerdure is. 


Large data example 


One of the main advantages of VEIL is its scalability, i.e. it is applicable in large data 
cases. This section describes a scenario where it is difficult to use the PMMH and IS^ 
methods. Consider a large data case with a large number of panels n. From (ITTll . for 
hxed Aj, the variance of the log-likelihood estimator V( 2 ;| 6 ') increases linearly with n. 
Therefore, when n is large enough, the PMMH and IS^ methods will not work in a 


2011 1. In this 


practical sense, because V(z| 6 *) can be very large (IFlurv and Shephardl. 

GLMM setting, PMMH and IS^ do not work when V(z|0) is as large as 6 or 7. One 
can decrease V( 2 :| 6 *) by increasing A*, but this can be too computationally expensive 
to be practical. 

We generate a data set of n = 3000 from the following logistic model with a random 
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intercept 


p{yij\l3,ai) = Binomial (19) 

logit(py) = (3i + (32Xij + ai, aj~iV( 0 ,r^), i = 1 ,n, j = 1 ,n*, 

with /5 = (—1.5,2.5)', r^ = 1.5, ni = 5, ~17(0,1). It takes, on average across different 

6, 30 seconds to carry ont each likelihood estimation with the numbers of particles 
Ni tuned to target V( 2 ;| 6 ') = l, which requires iV = = 3855 particles. So if an 

optimal PMMH procedure was run on our computer to generate a chain of 30000 
iterations as done in the Six City data example, it would take 10.4 days. We now 
run VEIL with set to 30, which on average requires N = 'Y^Ni/n= 187 particles 
and 0.7 seconds for each likelihood estimation. The VEIL procedure stopped after 15 
iterations with the clock time taken was 23 minutes. Figure H] plots the variational 
approximations of the marginal posteriors, which are bell shaped as expected with a 
large dataset. 





Application to GLMM: large data 
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5.2 Application to state space models 

In state space models, the observations yt are observed in time order. At time t, the 
distribntion of yt conditional on a state variable Xt is independently distributed as 

yt\xt ~ gt{yt\xt,6), 

and the state variables {xt}t>i are a Markov chain with 

xi ~ fidi-), xt\xt_i ~ ft{xt\xt-i, 9). 

The likelihood of the data y = yi-T is given by 

p{y\9) = J p{y\x,9)p{x\9)dx (20) 


with x = Xi,T and 


T 


p{x\e) = pe{.xi)Wft{xt\xt-i,9), p{y\x,e) = W_gt{,yt\xu9). 


t=2 


t=l 


Given a value of 6*, th e likelihood v{v\6) can be 


sampling estimator flShenhard and 


’itt 


particle hlter estimator fiDel Morai 
number of particles. 


1997- 


2004; 


unbiasedly estimated by an im portance 


Durbin and Koopmanl. 


Pitt et ah 


19971) or by a 


2 OI 2 I ). PN{y\9)-, with N the 


An important example of state space models is the stochastic volatility (SV) 
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model. The time series data yt is modeled as 


yt = exp(a;4/2)w;t, ~ AA(0,1), 

xt = yi + 0(xt_i - /i) + avt, xi ~ 7\/'(/i, 


a 


1 - 


-), vt ~ 7 ^( 0 , 1 ), 


with p G M, 0 G (—1,1) and cx^ > 0. Let r = (l + (^)/2 G (0,1); we will esti mate r but 


report results for (j). The model parameters are 9 = We follow 


Kimetal 


fjl998h and use a normal prior A/'(0,10) for p, a Beta prior 5(20,1.5) for r and an 
inverse gamma IG(2.5,0.025) for a^. 

To illustrate the VEIL algorithm for state space models, we analyze the week¬ 
day close exchange rates rt for the Australian Dollar/US Dollar from 5/1/2010 to 
31/12/2013. The data are available from the Reserve Bank of Australia. The data yt 

is 


yt = 100 log 


D+i 

n 




D+1 


, t = 1,...,T = 1001. 


2=1 


We use the variational distribution q\{0) =q{y.)q{T)q{a‘^), where g(p) is 7\/'(p^,cT^), g(r) 
is Beta(ar,/9r) and qi^cr^) is inverse gamma IG(q!o-2,/7o. 2). We employ the constraint 
a,- > 1 and /S,- > 1 to make sure that g(r) has a mode. The likelihood estimator 
PN{y\9) is computed by a basic particle filter. We then run the VEIL algorithm with 
S = 1000 samples, starting with = 0, = 0.3, a,- = 95, j3r = 5, a ^2 = 11, ( 3^2 = 1. 

This initial point is set so that the initial mean values of p, 0 and are 0, 0.9 and 
0.1 respectively, which is pretty far away from the posterior means; see Figure |5l The 
VEIL algorithm stops after 28 iterations. Figure [31(b) plots the scaled lower bounds 
over the iterations. 

The VEIL is compared to pseudo-marginal MGMG simulation, based on an adap¬ 
tive random walk Metropolis-Eastings algorithm, with 100,000 iterations starting 
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from the same values /i = 0, r = 0.95 and cr^ = 0.1. The number of particles used in 
MCMC is fixed at A^ = 300, so that V(pAr(?/|i9))!^l at the initial value (9 = (0,0.95,0.1). 
The number of particles used in VEIL is hxed at = 100 as the use of randomised 
QMC for generating 6 helps reduce greatly the variance in estimating the gradient. 
We fix N in this example as it is difficult to estimate the variance of log-likelihood 
estimates obtained by the particle hlter. 

Figure [5] plots the MCMC estimates (dotted line) and the VEIL estimates (solid 
line) of the marginal posteriors. The hgure shows that the VEIL estimates are close 
to the MCMC estimates but consume significantly less computational resources. The 
CPU times taken to run the VEIL and MCMC procedures are 0.7 and 28 minutes 
respectively. 



Figure 5: Application to state space models: Exchange rate data 
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5.3 Application to ABC 


In many modern applications, such as in genetics flTavare et al.l. ll997D . we either 
cannot evaluate the likelihood p{y\0) pointwise or do not wish to do so, but we can 
sample from i t , i.e. we can simulate y'^p{-\0)- Approximate Bayesian computation 


(ITavare et ah 


199711 approximates the likelihood by 


Plf(!/I^) = / K^(S(y'),S{y))p(y'\e)dy', 


( 21 ) 


where is a. kernel with the bandwidth e and S{.) is a vector of summary statis¬ 

tics. Inference is then based on the approximate posterior PABc{^\y) o(.p{^)PLF{y\0)■ 
Because the likelihood-free function Plf(?/| 6') can be unbiasedly estimated by 

1 ^ 
i=l 


it is straightforward to use the VBIL algorithm to approximate Pabc(^|?/)- 

We illustrate the application of the VBIL al gorithm to A BC by using it to £t 


an a-stable distribution, a-stable distributions flNolanl. 


200711 are a class of heavy¬ 


tailed distributions used in many statistical applications. An a-stable distribution 
iS(a,/5,7,5) is parameterized by the stability parameter Q!G(0,2), skewness /3g(— 1,1), 
scale 7 > 0 and location 5 G M. The main difficulty when working with a-stable 
distributions is that they do not have closed form densities, which makes it difficult 


to do inference. However, as it is easy to sam ple from an g-stab 


use ABC techniques for Bayesian inference fjPeters et ah 


e distribution, one can 


201211 . We illustrate in this 


example that VBIL provides an efficient approach for fitting a-stable distributions. 

We generate a data set y with n = 500 observations from a univariate a-stable 
distribution 5(1.5,0.5,1,0). Let Qp be the pth quantile of a pseudo-data set y' generated 
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from iS(a,/3,7,5). We follow IPeters et al.l f)2012[l and use the summary statistics S{y') = 
(ya,vy,Vj,vs) with 


^ %.95 ~ Qo.05 ^ %.95 + Qom — 2go.5 ^ <10.75 ~ <10.25 ^ 1 / 

v„ = - -- , Vy = -- , Vs = -2_^ Vi- 


<10.75 — <10.25 


<10.95 — <l0.05 


7 


n 


2=1 


For the obse r ved d ata ?/, the parameter 7 in rL is estimated using McCulloch’s method 


f McCulloch! . 


19861) . As the parameterization is discontinuous at a = l, resulting in 


poor estimates of the summary statistics, we consider the case w ith a > 1 and restrict 


Peters et ah 


(I2OI2I). 


the support of a to the interval ( 1 . 1 , 2 ) as in 
We reparameterize 


5 = iog( ^^ \'^ ) e M, /I = log(^^) e M, 7 = log(7) e R, 5 = 5 e R, 


a 




and estimate 9 = {a,(3,'^,6) but report the results for {a,(3,'y,6). We use a normal prior 
0 ~A/'( 0 , 100 J 4 ) and approximate the posterior p{6\y) by a normal variational distri¬ 
bution qx{9) = N. One can work with the original parameterization {a,(3,'y,6) 
and use some form of factorization q{(y)q{(3)q{'y)q{6). We choose to work with 9 to 
account for the posterior dependence between the parameters. This also illustrates 
the flexibility of the VEIL method in the sense that it can be used without requiring 
factorization. 

We use the Gaussian kernel with covariance matrix O.OI/4 for the likelihood-free 
PLF{y\9) in (EH). The VEIL is compared to pseudo-marginal Metropolis -Hastings 


meth ods with 20,000 iterations after 5000 burnins. For the standard PMMH flAndrieu and Roberts! . 


20091) . the number of pseudo-data sets iV = 20 is selected set after some trials in order 


to have a well-mixing chain. Efficient versions of PMMH has been proposed recently, 
which are more tolerant of noise in the likelihood estimates. Here we compare VEIL 
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to the blockwise PMMH method of iTran et al.l ( 20161) . For the blockwise PMMH, we 
set N = 5. We also use this value of N in VEIL. Table [U shows the VEIL and MCMC 
estimates, and the CPU times. As shown, VEIL is orders of magnitude faster than 
MCMC in this example. Figure [3]^c) plots the scaled lower bounds over the iterations. 



True 

Standard PMMH 

Elockwise PMMH 

VEIL 

a 

1.5 

1.57 (0.15) 

1.58 (0.14) 

1.57 (0.11) 


0.5 

0.46 (0.21) 

0.45 (0.21) 

0.48 (0.16) 

7 

1 

1.04 (0.12) 

1.04 (0.12) 

1.02 (0.12) 

(5 

0 

-0.08 (0.21) 

-0.09 (0.18) 

-0.08 (0.14) 

CPU time (min) 


12.56 

7.62 

0.12 


Table 1: AEC example: Standard PMMH, blockwise PMMH and VEIL estimates 
of a, f3, 7 and 6. The numbers in brackets are estimates of the posterior standard 
deviations. 


5.4 Using VEIL to improve marginal posterior estimates 

A drawback of VE methods in general is that the factorization assumption as in 
([Toll ignores the posterior dependence betw een the factor s , whi ch might lead to poor 


approximations of the posterior variances (iNeville et ahl. 


2014) . We now show how 


the VEIL algorithm can be used to help overcome this problem. 

Suppose that we would like to have a highly accurate VE approximation to the 
marginal posterior p{9^^^y). We restrict ourselves to the case with a tractable like¬ 
lihood for simplicity, but the following discussion also applies when the likelihood is 
intractable. The likelihood of 




( 22 ) 


with 9^'^^^ = (9^^\...,9^^ is in general intractable but can be estimated 

unbiasedly. Let be an approximation to the marginal posterior p{9^'^^^\y) 














resulting from a classical VB method that uses the factorization ffTOj) . The integral 
in fl2^ can be estimated unbiasedly using importance sampling with the proposal 
density or a tail-flattened version of it. This is accurate enough in practice 

because VEIL does not require a very accurate estimate of p{y\9^^'^) as discussed 
in Section |H The VEIL algorithm can then be used to approximate the marginal 
posterior p{9^^^y) directly with integrated out. The resulting approximation is 
often highly accurate as the dependence between 9^^^ and 9^'^^'^ is taken into account. 

A formal justihcation is as follows. We use the notation as in flTOll and write 
A = Suppose that we estimate the marginal posterior of A^-^^ by qxU){9^^^) 

which belongs to a family J^={qx(j){9^^'>),X^^'^ ^A}. VEIL proceeds by minimizing 

over A^-^^ G A. Let be the VEIL estimator. Under Assumptions 1 and 2 or when 
the number of samples N used to estimate (j2^ is large enough, aI'^^ is guaranteed to 
be a minimizer of KLj(A*'-^^). Assume further that KLj(A*''^^) is convex, then 

KL^(A1^’^) < KLj{X^^^) for all A^^'^ G A. (23) 

If we use a VB procedure with a factorization of the form q\{9)=qxu){9^^'*)qx(\j){9^'^^^) 
where qxu){9^^'^) belongs to the same family J^, then VB proceeds by minimizing the 
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KL divergence 


KL(A0',A<'^I) = j q.u, (f>'^'>)q,cy„ log 
= /?.«.(«“) log 

+ / </«.(««') / 

= KL,.(A“) + j ?,o,(9<-’') j toj,(9WI)log-||Hi||^<(9(V)d0ii^ 

Let be a minimizer of fl2^ . Because of the decomposition in flM|) . the 

estimator A*--^^ is not necessarily the minimizer of KLj(A*'-^^). From fl2^ . 

KL,(Al^)) <KL,(A(^)). (25) 

So the VEIL estimator aI'^^ is no worse than the factorization-based VB estimator 
in terms of KL divergence. 

We illustrate this application by generating n= 100 observations from a univariate 
mixture of two normals 


p{x) = UjAf{x\lJ.i, CTi) + (1 - uj)Af{x\p 2 , 0 - 2 ) 


with cj = 0.3, jui = —3, /i 2 = 3, cr^ = 2 and cr^ = 3. Suppose that we are interested 
in getting an accurate variational approximation of the posterior p(ujy). Getting 
an accurate estimate of w is often more challenging than the other parameters. We 
use diffuse priors a;~f/(0,l), /xi ~ A/'(0,100), /i 2 ~ A/'(0,100), o'f ~ and cr^ ~ 


a. 


2\-l 


, and run VEIL to approximate v (t jlv) b y a Beta distribution. We use the VB 


algorithm of 


McGrorv and Titterinetonl (120071) . in which the variational distribution 
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is factorized as q{u)q{a‘f,a 2 )q{fii,fi 2 Wi,(^ 2 )^ to design the proposal density to obtain 
an importance sampling estimator of p{y\u). 

Figure [6] plots the McGrory-Titterington estimate (dashed line) and VEIL esti¬ 
mate (solid line) of the posterior p{u\y). As shown, the VEIL estimate has heavier 
tails than the VE estimate. Ey fl25|) . it follows that the difference between the two es¬ 
timates gives an indication of the extent to which the McGrory-Titterington estimate 
is suboptimal. This example shows that the VEIL method provides an attractive way 
to obtain accurate approximation of marginal posteriors. 



Figure 6: Plots the VE (dashed line) and VEIL estimates (solid line) of the posterior 
p{uj\y). 

6 Conclusion 

We have proposed the VEIL, a useful VE algorithm for Eayesian inference in statis¬ 
tical modeling where the likelihood is intractable. The method makes it possible to 
do inference in statistical models using VE in some situations where that was previ¬ 
ously impossible. The main advantage of VEIL over its competitors, such as PMMH 
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and IS^, is its scalability. We show in the examples that VEIL is several orders of 
magnitnde faster than these existing methods. 
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Appendix 


Proof of TheoremUl (i) Under Assnmptions 1 and 2, we have that 


KL(A) = KL(gdk) 


qx{9)E{z\e)d9 = KL(gA||vr) + 



(26) 


where KL(gA||7r) is the Knllback-Leibler divergence between the variational distribn- 
tion q\{9) and the posterior 7i{9). So, VaKL(A) = VAKL(gA||7r) is independent of 
and minimizing KL(A) with respect to A is eqnivalent to minimizing KL(gA||7r). 
Algorithm 1 and 2 are the Robbins-Monro procednre for finding the root A* of the 
eqnation VAKL(gA||7i') = 0. Then, flT^ follows from Theorem 1 of 
the constant ca* independent of 


Sacks! f 1958!) with 


(ii) Denote h{9,z) = \og{p{9)pf^{y\9,z)) = \og{p{9)p{y\9)) + z = h{9) + z. We consider 
the case with the noisy traditional gradient in ([6]); the proof for the other cases is 
similar. We denote by VaKL(A*) the noisy gradient obtained when the likelihood is 
available. Then, noting that E*((^*(6')) =0, the constant c in ([7j) is 


Ee,^{a9f{\ogqx*{9) - h{9) - z)} ^ EdC(^)^(loggA*(6') - h{9))} ^ ^ ~ ^ 
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We note that c is the control variate constant we would use to compute VaKL(A*) if 
the likelihood was known. 


V(VaKL(A*)) 


^V 0 ,^|C*( 6 ')(loggA*( 6 ') - h{d) - z-c)^ 

2 1 2 

^V.{C.(«)} + ^V.{c.(»)(logto.(») - h(ff) + y - c)} 

yV.{c.(9)} + iv.{c.(9)(loggA-(9) - 9(9) -?)} 

2 

yV.{c.(9)}+V(WKL(A*)). 


Therefore, 

2 

<„(h,) = Ca.V(v);KL(A-)) = + CA.yV,{c.(9)}. 


□ 


Derivation for Section 15.1 

The density of the d—variate normal is 




(27r)'^/2|S|V2 


exp 




A simplihed form of the inverse Fis her ma t rix fo r a multivariate normal under the 


natural parameterization is given in 


Wandl (j2014il . For a. dxd matrix A, denote by 


vec(A) the d^-vector obtained by stacking the columns of A, by vech(A) the |d(d+l)- 
vector obtained by stacking the columns of the lower triangular part of A. The 
duplication matrix of order d, Dd, is the d^ x id(d+l) matrix of zeros and ones such 
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that for any symmetric matrix A 


Ddvech(A) = vec(y4). 

Let be the Moore-Penrose inverse of Dd, and vec“^ be the inverse 

of the operator vec. Then, the exponential family form of the normal distribntion 
g(/3) is g(/3)=exp(T(/3y\-Z(A)) with 


/S 

X = 

Ai 


1 

7 

w 

vech{/3/3') 


1 

(M 

<< 


-iD>ec(E-‘) 


T(/?) = 


The usnal mean and variance parameterization is 


(27) 




Wandl (120141 ) derives the following very nsefnl formnla 


If(^) - 


-S-^M S-^ 


with 


M = 2Zl+(/i (g) h) and ^ = 2Z1+(S O S)Z1+', 


(28) 


where ® is the Kronecker prodnct and Id the identity matrix of order d. The gradient 
VA[logg(/3)] is 

(3-fx 

vech(/5/l' — S — /i/i') 


VA[logg(/l)] = 


(29) 
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For the inverse gamma distribution q'(r^) with density 






exp(-Vr"), 


the natural parameters are (a,6). The Fisher information matrix for the inverse 
gamma is 

fv.„|logr(a)| -1/6^ 

Ip{a,b) = 

^ — 1/6 a/b"^ ^ 

and the gradient 

Va[loggA(^)] = -log(r^)+ log(6) - Va[logF(a)] 

v4iog,,(«)] = -4 + 1^ 

b 
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